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Abstract 



We propose a new data mining approach in ranking documents based on the concept of 
cone-based generalized inequalities between vectors. A partial ordering between two vec- 
tors is made with respect to a proper cone and thus learning the preferences is formulated 
as learning proper cones. A pairwise learning-to-rank algorithm (ConeRank) is proposed 
to learn a non-negative subspace, formulated as a polyhedral cone, over document-pair 
differences. The algorithm is regularized by controlling the 'volume' of the cone. The 
experimental studies on the latest and largest ranking dataset LETOR 4.0 shows that 
| ConeRank is competitive against other recent ranking approaches. 

o 

^ ; 1 Introduction 

O , Learning to rank in information retrieval (IR) is an emerging subject [7J [TTJ [HI SI IS] with great 
promise to improve the retrieval results by applying machine learning techniques to learn the 
document relevance with respect to a query. Typically, the user submits a query and the system 
returns a list of related documents. We would like to learn a ranking function that outputs the 
^ ■ position of each returned document in the decreasing order of relevance. 

Generally, the problem can be studied in the supervised learning setting, in that for each 
query- document pair, there is an extracted feature vector and a position label in the ranking. 
The feature can be either query-specific (e.g. the number of matched keywords in the document 
title) or query-independent (e.g. the PageRank score of the document, number of in-links and 
out-links, document length, or the URL domain). In training data, we have a groundtruth 
ranking per query, which can be in the form of a relevance score assigned to each document, or 
an ordered list in decreasing level of relevance. 

The learning-to-rank problem has been approached from different angles, either treating 
the ranking problem as ordinal regression [TOl [6], in which an ordinal label is assigned to a 
document, as pairwise preference classification [Til H] or as a listwise permutation problem 
[HE]. 

We focus on the pairwise approach, in that ordered pairs of document per query will be 
treated as training instances, and in testing, predicted pairwise orders within a query will 
be combined to make a final ranking. The advantage of this approach is that many existing 
powerful binary classifiers that can be adapted with minimal changes - SVM [TTJ, boosting [9], 
or logistic regression [I] are some choices. 
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We introduce an entirely new perspective based on the concept of cone-based generalized 
inequality. More specifically, the inequality between two multidimensional vectors is defined 
with respect to a cone. Recall that a cone is a geometrical object in that if two vectors belong 
to the cone, then any non-negative linear combination of the two vectors also belongs to the 
cone. Translated into the framework of our problem, this means that given a cone /C, when 
document I is ranked higher than document m, the feature vector x; is 'greater' than the feature 
vector x, m with respect to /C if x; — x m G /C. Thus, given a cone, we can find the correct order 
of preference for any given document pair. However, since the cone K, is not known in advance, 
it needs to be estimated from the data. Thus, in our paper, we consider polyhedral cones 
constructed from basis vectors and propose a method for learning the cones via the estimation 
of this set of basis vectors. 

This paper makes the following contributions: 

• A novel formulation of the learning to rank problem, termed as ConeRank, from the angle 
of cone learning and generalized inequalities; 

• A study on the generalization bounds of the proposed method; 

• Efficient online cone learning algorithms, scalable with large datasets; and, 

• An evaluation of the algorithms on the latest LETOR 4.0 benchmark dataset 0. 



2 Previous Work 

Learning-to-rank is an active topic in machine learning, although ranking and permutations 
have been studied widely in statistics. One of the earliest paper in machine learning is perhaps 
[7]. The seminal paper [IT] stimulates much subsequent research. Machine learning methods 
extended to ranking can be divided into: 

Pointwise approaches, that include methods such as ordinal regression [T0j[6]. Each query- 
document pair is assigned a ordinal label, e.g. from the set {0, 1, 2, L}. This simplifies 
the problem as we do not need to worry about the exponential number of permutations. The 
complexity is therfore linear in the number of query- document pairs. The drawback is that the 
ordering relation between documents is not explicitly modelled. 

Pairwise approaches, that span preference to binary classification [TTJ [9], H] methods, where 
the goal is to learn a classifier that can separate two documents (per query). This casts the 
ranking problem into a standard classification framework, wherein many algorithms are readily 
available. The complexity is quadratic in number of documents per query and linear in number 
of queries. 

Listwise approaches, modelling the distribution of permutations [5]. The ultimate goal is 
to model a full distribution of all permutations, and the prediction phase outputs the most 
probable permutation. In the statistics community, this problem has been long addressed [T4"] . 
from a different angle. The main difficulty is that the number of permutations is exponential 
and thus approximate inference is often used. 

However, in IR, often the evaluation criteria is different from those employed in learning. So 
there is a trend to optimize the (approximate or bound) IR metrics [Hj. 

1 Available at: http:/ /research. microsoft.com/en-us/um/bcijing/projccts/lctor/letor4datasct.aspx 
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Figure V. Illustration of ConeRank. Here the pairwise differences are distributed in 3- 
dimensional space, most of which however lie only on a surface and can be captured most 
effectively by a 'minimum' cone plotted in green. Red stars denotes noisy samples. 
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3 Proposed Method 



3.1 Problem Settings 

We consider a training set of P queries qi,q2, ■ ■ ■ ,Qp randomly sampled from a query space 
Q according to some distribution Pq. Associated with each query q is a set of documents 
represented as pre-processed feature vectors {x^, x| . . .}, x.f £ ~R N with relevance scores rf, rf, . . . 
from which ranking over documents can be based. We note that the values of the feature vectors 
may be query-specific and thus the same document can have different feature vectors according 
to different queries. Document is said to be more preferred than document x^ for a given 
query q if rf > and vice versa. In the pairwise approach, pursued in this paper, equivalently 
we learn a ranking function / that takes input as a pair of different documents x^,x^ for a 
given query q and returns a value y £ {+1,-1} where +1 corresponds to the case where 
is ranked above x q m and vice versa. For notational simplicity, we may drop the superscript q 
where there is no confusion. 

3.2 Ranking as Learning Generalized Inequalities 

In this work, we consider the ranking problem from the viewpoint of generalized inequalities. 
In convex optimization theory [5J p. 34], a generalized inequality denotes a partial ordering 
induced by a proper cone /C, which is convex, closed, solid, and pointed: 



Generalized inequalities satisfy many properties such as preservation under addition, transitiv- 
ity, preservation under non-negative scaling, reflexivity, anti-symmetry, and preservation under 
limit. 

We propose to learn a generalized inequality or, equivalently, a proper cone K that best de- 
scribes the training data (see Fig. [TJfor an illustration). Our important assumption is that this 
proper cone, which induces the generalized inequality, is not query-specific and thus prediction 
can be used for unseen queries and document pairs coming from the same distributions. 

From a fundamental property of convex cones, if z £ /C then wz £ K, for all w > 0, and any 
non-negative combination of the cone elements also belongs to the cone, i.e. if £ K. then 
]Cfc w k u k G JC, Vw/o > 0. 

In this work, we restrict our attention to polyhedral cones for the learning of generalized 
inequalities. A polyhedral cone is a polyhedron and a cone. A polyhedral cone can be defined 
as sum of rays or intersection of halfspaces. We construct the polyhedral cone /C from 'basis' 
vectors U = [ui , 112, . . . , u.jc] ■ They are the extreme vectors lying on the intersection of hyper- 
planes that define the halfspaces. Thus, the cone K, is a conic hull of the basis vectors and is 
completely specified if the basis vectors are known. A polyhedral cone with K basis vectors is 
said to have an order K if one basis vector cannot be expressed as a conic combination of the 
others. It can be verified that under these regular conditions, a polyhedral cone is a proper 
cone and thus can induce a generalized inequality. We thus propose to learn the basis vectors 
Ufc, k = 1, . . . , K for the characterization of /C. 

A projection of z onto the cone /C, denoted by Pjc(z), is generally defined as some z' £ /C 
such that a certain criterion on the distance between z and z' is met. As z' £ /C, it follows 
that it admits a conic representation z' = Ylk=i w k u k = Uw, wt > 0. By restricting the order 
K < N, it can be shown that when U is full-rank then the conic representation is unique. 

Define an ordered document-pair (l,m) difference as z = - x m where, without loss of 
generality, we assume that ri > r rn . The linear representation of z' £ /C can be found from 




mm 



z — UwHg, w > 



(1) 



w 
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where the inequality constraint is element- wise. It can be seen that Pac(z) = z,Vz G /C. 
Otherwise, if z G" /C then it can be easily proved by contradiction that the solution w is such 
that Uw lies on a facet of /C. Let K~ be the cone with the basis — U then it can be easily 
shown that if z G Kr then Pjc(z) = 0. 

Returning to the ranking problem, we need to find a -fT-degree polyhedral cone /C that 
captures most of the training data. Define the £2 distance from z to /C as dfc(z) = ||z — Pjc(z)||2 
then we define the document-pair-level loss as 

/(/C;z,y)=^(z) 2 . (2) 

Suppose that for a query q, a set of document pair differences S q = {z\, . . . , z£ } with relevance 
differences <f>\ , . . . , , 0j > can be obtained. Following [13], we define the empirical query- 
level loss as 

n q 

L(/C; ? ,^) = -^(/C;z<^). (3) 

Uq 3=1 

For a full training set of P queries and S = {S qi , . . . , S qp } samples, we define the query-level 
empirical risk as 

1 p 

ij(/C;5) = -^L(/C; fc S 5i ). (4) 

i=l 

Thus, the polyhedral cone JC can be found from minimizing this query-level empirical risk. 
Note that even though other performance measures such as mean average precision (MAP) 
or normalized discounted cumulative gain (NDCG) is the ultimate assessment, it is observed 
that good empirical risk often leads to good MAP/NDCG and simplifies the learning. We next 
discuss some additional constraints for the algorithm to achieve good generalization ability. 



3.3 Modification 

Normalization. Using the proposed approach, the direction of the vector z is more important 
than its magnitude. However, at the same time, if the magnitude of z is small it is desirable 
to suppress its contribution to the objective function. We thus propose the normalization of 
input document-pair differences as follows 

z <- pz/(a + ||z|| 2 ), a,p>0. (5) 

The constant p is simply the scaling factor whilst a is to suppress the noise when ||z|| 2 is too 
small. With this normalization, we note that 

Il z ll2 < P- (6) 

Relevance weighting. In the current setting, we consider all ordered document-pairs equally 
important. This is however a disadvantage because the cost of the mismatch between the two 
vectors which are close in rank is less than the cost between those distant in rank. To address 
this issue, we propose an extension of ([2]) 

/(/C;z,y) = ^(z) 2 . (7) 

where <fi > is the corresponding ordered relevance difference. 

Conic regularization. From statistical learning theory [151 ch.4], it is known that in order to 
obtain good generalization bounds, it is important to restrict the hypothesis space from which 
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the learned function is to be found. Otherwise, the direct solution from an unconstrained em- 
pirical risk minimization problem is likely to overfit and introduces large variance (uncertainty). 
In many cases, this translates to controlling the complexity of the learning function. In the 
case of support vector machines (SVMs), this has the intuitive interpretation of maximizing 
the margin, which is the inverse of the norm of the learning function in the Hilbert space. 

In our problem, we seek a cone which captures most of the training examples, i.e. the 
cone that encloses the conic hull of most training samples. In the SVM case, there are many 
possible hyperplanes that separates the samples without a controlled margin. Similarly, there is 
also a large number of polyhedral cones that can capture the training samples without further 
constraints. In fact, minimizing the empirical risk will tend to select the cone with larger 
solid angle so that the training examples will have small loss (see Fig. [2]). In our case, the 
complexity is translated roughly to the size (volume) of the cone. The bigger cone will likely 
overfit (enclose) the noisy training samples and thus reduces generalization. Thus, we propose 
the following constraint to indirectly regularize the size of the cone 

< A z < || w|| i < A u , w > (8) 

where w is the coefficients defined as in ([T]) and for simplicity we set A/ = 1. To see how this 
effectively controls /C, consider a 2D toy example in Fig. |2J If X u = 1, the solution is the cone K\. 
In this case, the loss of the positive training examples (within the cone) is the distance from them 
to the simplex define over the basis vectors Ui, U2 (i.e. {z : z = Aui + (1 — A)u2, < A < 1}) 
and the loss of the negative training example is the distance to the cone. With the same 
training examples, if we let X u > 1 then there exists a cone solution /C 2 such that all the 
losses are effectively zero. In particular, for each training example, there exists a corresponding 
||w||i = A such that the corresponding simplex {z : z = u>iUi + w 2 u 2 , W\ + w 2 = A}, passes all 
positive training examples. 

Finally, we note that as the product UwJ 1 appears in the objective function and that both 
U and wj 1 are variables then there is a scaling ambiguity in the formulation. We suggest to 
address this scale ambiguity by considering the norm constraint ||ufc|| 2 = c > on the basis 
vectors. 

In summary, the proposed formulation can be explicitly written as 

s.t. ||u fc || 2 = c,wf > 0,0 < A, < ||wf ||i < A u . 
3.4 Generalization bound 

We restrict our study on generalization bound from an algorithmic stability viewpoint, which is 
initially introduced in [2] and based on the concentration property of random variables. In the 
ranking context, generalization bounds for point- wise ranking / ordinal regression have been 
obtained [U |8]. Recently, [13] show that the generalization bound result in [2] still holds in 
the ranking context. More specifically, we would like to study the variation of the expected 
query-level risk, defined as 

R{K)= [ L(IC;q)P Q (dq). (10) 
JQxy 

where L(]C; q) denotes the expected query-level loss defined as 

L(/C;g)= J l(JC;rf,y°)P z {<hP) (11) 
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Figure 2: Illustration of different cone solutions. For simplicity, we plot for the case c = 1 and 
||z|| 2 ~ 1. 

and Pz denotes the probability distribution of the (ordered) document differences. 

Following [2] and [13] we define the uniform leave-one-query-out document-pair-level stability 

as 

/3= sup |/(/C s ;z«,j/«)-Z(X: s - 4 ;z9,2/«)| (12) 

qeQ,i€[l,...,P] 

where JCs and fCs-t are respectively the polyhedral cones learned from the full training set and 
that without the ith query. As stated in |13j . it can be easily shown the following query-level 
stability bounds by integration or average sum of the term on the left hand side in the above 
definition 

\L{K S ]q)-L{K s -i]q)\<Pyi (13) 
\L(K s ;q)-L{K s -i;q)\<^i. (14) 

Using the above query-level stability results and by considering S qi as query-level samples, one 
can directly apply the result in [2] (see also [13]) to obtain the following generalization bound 

Theorem 1 For the proposed ConeRank algorithm with uniform leave-one-query-out document- 
pair-level stability f3, with probability of at least 1 — e it holds 



R(ICs) < R(ICs) + 2/3 + (4P/3 + 7 ) \j (15) 



where 7 = sup gQ I (JCs', z 9 , y q ) and e G [0,1]. 
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As can be seen, the bound on the expected query-level risk depends on the stability. It is 
of practical interest to study the stability /3 for the proposed algorithm. The following result 
shows that the change in the cone due to leaving one query out can provide an effective upper 
bound on the uniform stability /3. For notational simplicity, we only consider the non-weighted 
version of the loss, as the weighted version is simply a scale of the bound by the maximum 
weight. 

Theorem 2 Denote as U and XJ~ l the 'basis' vectors of the polyhedral cones /C5 and 
respectively. For a ConeRank algorithm with non-weighted loss, we have 

P < 2wA u (p + VKc\ u ) + s 2 max X 2 u , (16) 

where s max = maxj ||U — U _J || ; || • || denotes the spectral norm, and p is the normalizing factor 

ofz (c.f m)- 

Proof. Following the proposed algorithm, we equivalently study the bound of 



(3 = sup 



q£Q 
-«ll2<P 



min llz 9 — Uwlln — min \\z q — U l w||9 
wee wgC " 1 



where the constraint set C = {w : w > 0, A; < ||w||i < A u }. Without loss of generality, we can 
assume that 

min II z 9 — Uwllo > min \\z q — XJ^wWl 
wee z wee z 

and the minima are attained at w and w~ l respectively. Due to the definition, it follows that 

< sup (||z« - Uw-*||1 - l|z ? — U^w - *]]!) . 

q£Q 
\\*<>\\2<P 

Expanding the term on the left, and using matrix norm inequalities, one obtains 



/3 < sup (2||U||||A|| + ||A|| 2 )||w 

geQ 



-i\\2 



+2||z g || 2 ||A||||w^|| 2 ) (17) 
where A = U — U _l . The proof follows by the following facts 

• 1 1 X_T 1 1 < \/~Kc due to each ||ufe|| 2 < c and that ||U|| < ||U||f where || • ||f denotes the 
Frobenius norm. 

• Il w ll2 — ll w lli for w > 

• || 1| 2 < p due to the normalization 

and that ||A|| < s max by definition. 

It is more interesting to study the bound on s max . We conjecture that this will depend on 
the sample size as well as the nature of the proposed conic regularization. However, this is still 
an open question and such an analysis is beyond the scope of the current work. 

We note importantly that as the stability bound can be made small by lowering X u . Doing 
so definitely improves stability at the cost of making the empirical risk large and hence the bias 
becomes significantly undesirable. In practice, it is important to select proper values of the 
parameters to provide optimal bias- variance trade-off. Next, we turn the discussion on practical 
implementation of the ideas, taking into account the large-scale nature of the problem. 
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4 Implementation 



In the original formulation (Q, the scaling ambiguity is resolved by placing a norm constraint 
on Ufc. However, a direct implementation seems difficult. In what follows, we propose an 
alternative implementation by resolving the ambiguity on w instead. We fix ||w||i = 1 and 
consider the norm inequality constraint on u k as || u.fc || 2 < c (i-e. convex relaxation on equality 
constraint) where c is a constant of C?(||z 9 ||2). This leads to an approximate formulation 

s.t. ||ufc|| 2 < c,wf > 0, ||wf||i = 1. 

The advantage of this approximation is that the optimization problem is now convex with 
respect to each and still convex with respect to each wj\ This suggests an alternating 
and iterative algorithm, where we only vary a subset of variables and fix the rest. The objec- 
tive function should then always decrease. As the problem is not strictly convex, there is no 
guarantee of a global solution. Nevertheless, a locally optimal solution can be obtained. The 
additional advantage of the formulation is that gradient-based methods can be used for each 
sub-problem and this is very important in large-scale problems. 

Algorithm 1 Stochastic Gradient Descent 
Input: queries qi and pair differences zj*. 
Randomly initialize u^, \/k < K; set fi > 
repeat 

1. The folding-in step (fixed U): 
Randomly initialize wj ! : wj l > 0; ||w|*||i = 1; 
repeat 

la. Compute <— — /jLdR(wf)/dwf 
lb. Set wj 1 <— maxjwj^O} (element-wise) 
lc. Normalize wj 4 4— wjyHwj'Hi 
until converged 

2. The basis-update step (fixed w): 
for k = 1 to K do 

2a. Update u k -f- u k — fidR(u k )/du k 
2b. Normalize to norm c if violated, 
end for 
until converged 



4.1 Stochastic Gradient 

Since the number of pairs may be large for typically real datasets, we do not want to store 
every wj. Instead, for each iteration, we perform a folding-in operation, in that we fix the basis 
U, and estimate the coefficients wj. Since this is a convex problem, it is possible to apply the 
stochastic gradient (SG) method as shown in Algorithm [TJ Note that we express the empirical 
risk as the function of only variable of interest when other variables are fixed for notational 
simplicity. In practice, we also need to check if the cone is proper and we find this is always 
satisfied. 
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4.2 Exponentiated Gradient 

Exponentiated Gradient (EG) [12] is an algorithm for estimating distribution-like parameters. 
Thus, Step la can be replaced by 

w| l <— w| l exp j— fidR(wj')/dw^ j (element-wise). 

For faster numerical computation (by avoiding the exponential), as shown in [T2], this step can 
be approximated by 




where the empirical risk R is parameterized in terms of = Uwj*. When the learning rate \i 
is sufficiently small, this update readily ensures the normalization of wj\ The main difference 
between SG and EG is that, update in SG is additive^ while it is multiplicative in EG. 

Algorithm 2 Query-level Prediction 

Input: New query q with pair differences {z^}™^ 

Maintain a scoring array A of all pre-computed feature vectors, initialize Ai = for all I. 
Set <j>) = l,Vj <n q . 
for j = 1 to n g do 

Perform folding-in to estimate the coefficients without the non-negativity constraints. 
Check if the sum of the coefficients is positive, then A\ «— A\ + 1 ; otherwise A m <— A m + 1 
end for 

Output the ranking based on the scoring array A. 



4.3 Prediction 

Assume that the basis U = (u l5 u 2 , u K ) has been learned during training. In testing, for 
each query, we are also given a set of feature vectors, and we need to compute a ranking function 
that outputs the appropriate positions of the vectors in the list. 

Unlike the training data where the order of the pair (/, m) is given, now this order information 
is missing. This breaks down the conic assumption, in that the difference of the two vectors 
is the non-negative combination of the basis vectors. Since the either preference orders can 
potentially be incorrect, we relax the constraint of the non-negative coefficients. The idea is 
that, if the order is correct, then the coefficients are mostly positive. On the other hand, if the 
order is incorrect, we should expect that the coefficients are mostly negative. The query-level 
prediction is proposed as shown in Algorithm [2j As this query-level prediction is performed 
over a query, it can address the shortcoming of logical discrepancy of document-level prediction 
in the pairwise approach. 

5 Discussion 

RankSVM [IT] defines the following loss function over ordered pair differences 

j 

where u G K is the parameter vector, C > is the penalty constant and P is the number of 
data pairs. 
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Being a pairwise approach, RankNet instead uses 

1 C 
L ( u ) = p5Z 1 °s( 1 + exp{-u T Zj }) + -||u||2. 

j 

This is essentially the 1-class SVM applied over the ordered pair differences. The quadratic 
regularization term tends to push the separating hyperplane away from the origin, i.e. maxi- 
mizing the 1-class margin. 

It can be seen that the RankSVM solution is the special case when the cone approaches a 
halfspace. In the original RankSVM algorithm, there is no intention to learn a non-negative 
subspace where ordinal information is to be found like in the case of ConeRank. This could 
potentially give ConeRank more analytical power to trace the origin of preferences. 

6 Experiments 

6.1 Data and Settings 

We run the proposed algorithm on the latest and largest benchmark data LETOR 4.0. This 
has two data sets for supervised learning, namely MQ2007 (1700 queries) and MQ2008 (800 
queries). Each returned document is assigned a integer- valued relevance score of {0, 1, 2} where 
means that the document is irrelevant with respect to the query. For each query- document 
pair, a vector of 46 features is pre-extracted, and available in the datasets. Example features 
include the term- frequency and the inverse document frequency in the body text, the title or the 
anchor text, as well as link-specific like the PageRank and the number of in-links. The data is 
split into a training set, a validation set and a test set. We normalize these features so that they 
are roughly distributed as Gaussian with zero means and unit standard deviations. During the 
folding-in step, the parameters corresponding to pair jth of query q are randomly initialized 
from the non- negative uniform distribution and then normalized so that 1 1 "vv J' 1 1 x = 1. The basis 
vectors are randomly initialized to satisfy the relaxed norm constraint. The learning rate 
is fi = 0.001 for the SG and p = 0.005 for the EG. For normalization, we select a = 1 and 
p = \/~N where N is the number of features, and we set c = 2p. 

6.2 Results 

The two widely-used evaluation metrics employed are the Mean Average Precision (MAP) and 
the Normalized Discounted Cumulative Gain (NDCG). We use the evaluation scripts distributed 
with LETOR 4.0. 

In the first experiment, we investigate the performance of the proposed method with respect 
to the number of basis vectors K. The result of this experiment on the MQ2007 dataset is 
shown in Fig. [3j We note an interesting observation that the performance is highest at about 
K = 10 out of 46 dimensions of the original feature space. This seems to suggest that the idea 
of capturing an informative subspace using the cone makes sense on this dataset. Furthermore, 
the study on the eigenvalue distribution of the non-centralized ordered pairwise differences on 
on the MQ2007 dataset, as shown in Fig. m also reveals that this is about the dimension that 
can capture most of the data energy. 

We then compare the proposed and recent base-line methods^] in the literature and the results 
on the MQ2007 and MQ2008 datasets are shown in Table [TJ The proposed ConeRank is studied 
with K = 10 due to the previous experiment. We note that all methods tend to perform better 

2 from http:/ /research. microsoft.com/en-us/um/beijing/projects/letor/letor4baselinc.aspx 
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Figure 3: Performance versus basis number 

on MQ2007 than MQ2008, which can be explained by the fact that the MQ2007 dataset is 
much larger than the other, and hence provides better training. 

On the MQ2007 dataset, ConeRank compares favourably with other methods. For example, 
ConeRank-SG achieves the highest MAP score, whilst its NDCG score differs only less than 
2% when compared with the best (RankSVM-struct). On the MQ2008 dataset, ConeRank still 
maintains within the 3% margin of the best methods on both MAP and NDCG metrics. 



Table 1: Results on LETOR 4.0. 
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Figure 4: Eigenvalue distribution on the MQ2007 dataset. 



7 Conclusion 

We have presented a new view on the learning to rank problem from a generalized inequali- 
ties perspective. We formulate the problem as learning a polyhedral cone that uncovers the 
non-negative subspace where ordinal information is found. A practical implementation of the 
method is suggested which is then observed to achieve comparable performance to state-of-the- 
art methods on the LETOR 4.0 benchmark data. 

There are some directions that require further research, including a more rigorous study 
on the bound of the spectral norm of the leave-one-query-out basis vector difference matrix, a 
better optimization scheme that solves the original formulation without relaxation, and a study 
on the informative dimensionality of the ranking problem. 
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